Overview
Dataset statistics
| Number of records |
2703242 |
| Distinct trips |
1351621 |
| Number of complete trips (start and and point) |
1351621 |
| Number of incomplete trips (single point) |
0 |
| Distinct users |
360619 |
| Distinct locations (lat & lon combination) |
200063 |
Additional variable
None
Missing values
| User ID (uid) |
0 |
| Trip ID (tid) |
0 |
| Timestamp (datetime) |
0 |
| Latitude (lat) |
0 |
| Longitude (lng) |
0 |
TODO: don't show exact values as they might be misleading. Maybe, e.g., with 99% chance there are no missing values (?)
There should not be any missing values within the dataset, except potentially within the additional variable.
If there are missing values present in the dataset this is an indicator for a faulty dataset.
Temporal properties
Number of trips over time
Distribution
| Min. |
2018-04-18 |
| 25% |
2018-04-19 |
| Median |
2018-04-19 |
| 75% |
2018-04-19 |
| Max. |
2018-04-20 |
Number of trips per weekday
Number of trips per hour of day split by weekday and weekend
Place analysis
Visits per tile
TODO: fix bug that all legends appear in first map
Points outside the given tessellation: 18
The following statistics give insights into the distribution of the visits over the tiles (mean, min, max and quartiles)
- whether there are tiles that are visited more often than others or if the visits are distributed equally over all tiles.
Distribution
| Min. |
16.0 |
| 25% |
2669.0 |
| Median |
5377.0 |
| 75% |
9905.0 |
| Max. |
46572.0 |
A different way of visualizing the distribution of visits over tiles is achieved by the cumuluated sum of all visits: If only a few tiles include most of all visits, the curve has a steep increase in the beginning and a flat part at the end.
If the visits are distributed equally over the entire tiles the line is a straight diagonal.
Ranking most frequently visited tiles
| 1 |
Kantstraße / Schillerstraße (Id: 110202420): 46572 |
| 2 |
Schönhauser Allee / Bornholmer Straße (Id: 110510620): 28672 |
| 3 |
Behmstraße / Stettiner Straße (Id: 110500720): 27970 |
| 4 |
Hausvogteiplatz (Id: 110110320): 26084 |
| 5 |
Friedrich-Wilhelm-Platz (Id: 110506110): 23276 |
| 6 |
Südstern (Id: 110401620): 23030 |
| 7 |
Osloer Straße / Seestraße (Id: 110500920): 22992 |
| 8 |
Sonnenallee / Hobrechtstraße (Id: 110407510): 22812 |
| 9 |
Kantstraße / Suarezstraße (Id: 110402410): 22742 |
| 10 |
Müllerstraße / Seestraße (Id: 110500910): 22558 |
Visits per tile and time window
Weekday: absolute count
Weekday: deviation from average
Origin-destination (OD) analysis
OD flows between tiles
# TODO: fix legend appearing in wrong chart
Intra-tile flows
The number and percentage of flows that start and end within the same tile.
218700.0 (16.18 %) of flows start and end within the same cell.
A large number of intra-cell flows either indicate round-trips
(e.g., going running starting and ending at the home location)
or a tessellation that is to coarse to properly capture flows.
Distribution
| Mean |
13.83 |
| Min. |
1.0 |
| 25% |
2.0 |
| Median |
4.0 |
| 75% |
10.0 |
| Max. |
4425.0 |
A different way of visualizing the distribution of number per flows is achieved by the cumuluated sum of all flows:
If only a few flows include most of all visits, the curve has a steep increase in the beginning and a flat part at the end.
If the visits are distributed equally over the entire flows the line is a straight diagonal.
Most frequent OD connections
Ranking most frequent OD connections
| 1 |
Kantstraße / Schillerstraße - Kantstraße / Schillerstraße: 4425.0 |
| 2 |
Wilhelmsruher Damm / Senftenberger Ring - Wilhelmsruher Damm / Senftenberger Ring: 3146.0 |
| 3 |
Friedrich-Wilhelm-Platz - Friedrich-Wilhelm-Platz: 2819.0 |
| 4 |
Frankfurter Allee / Petersburger Straße - Frankfurter Allee / Petersburger Straße: 2460.0 |
| 5 |
Friedrichshagener Straße / Bahnhofstraße - Friedrichshagener Straße / Bahnhofstraße: 2365.0 |
| 6 |
Schönhauser Allee / Bornholmer Straße - Schönhauser Allee / Bornholmer Straße: 2307.0 |
| 7 |
Müggelseedamm / Fürstenwalder Damm - Müggelseedamm / Fürstenwalder Damm: 2293.0 |
| 8 |
Sonnenallee / Hobrechtstraße - Sonnenallee / Hobrechtstraße: 2010.0 |
| 9 |
Wendenschloßstraße / Salvador-Allende-Straße - Wendenschloßstraße / Salvador-Allende-Straße: 1968.0 |
| 10 |
Südstern - Südstern: 1951.0 |
Trip statistics
Travel time of trips (in minutes)
24354 outliers have been excluded.
Outliers are values above 90
Distribution
| Min. |
4.0 |
| 25% |
16.0 |
| Median |
27.0 |
| 75% |
43.0 |
| Max. |
90.0 |
Jump length (in meters)
58818 outliers have been excluded.
Outliers are values above 15000
Distribution
| Min. |
0.0 |
| 25% |
834.19 |
| Median |
3082.89 |
| 75% |
5977.36 |
| Max. |
14999.65 |
User analysis
number of trajectories per user
Distribution
| Min. |
2.0 |
| 25% |
2.0 |
| Median |
4.0 |
| 75% |
5.0 |
| Max. |
16.0 |
Time between two consecutive trajectories of a user
How much time passes between two consecutive trajeoctories?
This information gives insights on the temporal density of the dataset.
Trajectories might follow each other consecutively, then the time inbetween only is as long as the stay duration at that place.
If the trips are only collected sparsely there might be days between single trajectories of a user.
This analysis is based on the assumption that trips of a user follow each other consecutively and do not overlap,
i.e., the start time of a following trip cannot start before the previous one has ended.
Therefore, we first perform a plausibility check to ensure that no user trips overlap.
Otherwise this might be an indication for a faulty dataset.
Plausibility check: overlapping user trips
There are 25815 cases where the start time of the following trajectory precedes the previous end time.
If there are overlapping trips present in the dataset the minimum time between trajectories will be negative.
Distribution
| Min. |
0 days 00:00:00 |
| 25% |
0 days 00:17:00 |
| Median |
0 days 01:15:00 |
| 75% |
0 days 03:40:00 |
| Max. |
0 days 22:21:00 |
Radius of gyration
The radius of gyration is the characteristic distance traveled by an individual during a period of time.
4472 outliers have been excluded.
Outliers are values above 15000
Distribution
| Min. |
0.0 |
| 25% |
1380.69 |
| Median |
2571.81 |
| 75% |
4414.05 |
| Max. |
9999.99 |
Location entropy
Location entropy (based on Shannon Entropy) captures the diversity of user visits.
If most trips to a certain location originate from a single (or few) user the entropy is low.
A high entropy suggests that the place is visited by diverse users evenly. A dataset with many cells with high visit counts but low entropy suggests, that single users drive certain mobility patterns that might not be representative for other users.
Number of distinct tiles per user
TODO: clean outliers - or not necessary as data set is already truncated?
How many different tiles does a single user visit?
Distribution
| Min. |
1.0 |
| 25% |
2.0 |
| Median |
3.0 |
| 75% |
3.0 |
| Max. |
12.0 |
Uncorrelated Entropy
The temporal-uncorrelated entropy characterizes the heterogeneity of the users visitation patterns (including the historical probability that a location was visited by the user).
TODO: clean outliers - or not necessary for uncorrelated entropy, as it is normalized?
Distribution
| Min. |
0.0 |
| 25% |
0.92 |
| Median |
0.96 |
| 75% |
1.0 |
| Max. |
1.0 |
Real Entropy (including sequence of locations)
TODO: clean outliers - or not necessary for real entropy?
None
None